Churn prediction is the task of identifying whether users are likely to stop using a service, product, or website. Companies and large corprorations strive to aqcuire new customers, a process which usually requires a lot of resources in time and money, and they will do everything to prevent a customer from leaving. Acquiring a new customer is often much more costly, rather than making an offer to a client who is ready to leave. More importantly, customers rarely reveal their complaints about a service or a product they use, or their intentions of leaving a company. Therefore, it is important for companies to regularly review historical user behavior patterns and accurately forecast the probability that a given customer may churn.
Furthermore, companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, whereas involuntary churn occurs due to circumstances such as a customer's relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies control, such as how billing interactions are handled or how after-sales help is provided.
Here, we are going to study the transactional data set which contains all the transactions occurring between December 1st, 2010 (01/12/2010) and December 9th, 2011 (09/12/2011) for a UK-based and registered non-store online retailer. The company mainly sells unique all-occasion gifts. The data set is provided through the UCI Machine Learning Directory (Online Retail), and our task will be to predict which customers are likely to churn given their purchasing activity. We will assume that none of these customers has stopped buying from this online retailer due to involuntary reasons, and that if some of them did was due to their intention to actually leave the company.
It is important to note that customer churn can be defined in many ways. In the GraphLab Create churn_predictor
toolkit that we are going to use below, churn is defined to be no activity for a fixed period of time (called the churn_period
). More specifically, a user is said to have churned if there he/she has no activity for a duration of time known as the churn_period
(by default, this is set to 30 days).
[Chen et. al 2012] Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
Data set has been provided from the UCI ML Repository: http://archive.ics.uci.edu/ml/datasets/Online+Retail
In [1]:
import graphlab as gl
import graphlab.aggregate as agg
import datetime as dt
from visualization_helper_functions import *
In [2]:
purchasing_data = gl.SFrame.read_csv('./../../04.UCI.ML.REPO/00352.Online_Retail/Online_Retail.csv',
column_type_hints=[str,str,str,int,str,float,str,str])
As shown below, purchasing_data
keeps the transactional data of purchases made by various customers of this online retailer during the period from 01/12/2010 to 09/12/2011. It keeps customer records per InvoiceNo
and StockCode
, the Quantity
and the UnitPrice
of each product procured, and the InvoiceDate
that the corresponding invoice has been issued. Customer country of residency, Country
, as well as a short description of each StockCode
is also provided.
In [3]:
purchasing_data.head(5)
Out[3]:
In [4]:
purchasing_data.dtype
Out[4]:
Now we need to convert the InvoiceDate
(which is a string) into a Python DateTime object.
In [5]:
def _str_to_datetime(x):
import datetime as dt
try:
return dt.datetime.strptime(x, '%m/%d/%Y %H:%M')
except (TypeError, ValueError):
return None
In [6]:
purchasing_data['InvoiceDate'] = purchasing_data['InvoiceDate'].apply(_str_to_datetime)
purchasing_data.head(5)
Out[6]:
In [7]:
gl.canvas.set_target('ipynb')
purchasing_data.show()
Here, we have used the special SFrame functionality, .show()
, to draw this summary statistics. An important advantage of this GraphLab Create method is its ability to draw plots of large data sets with unparallel ease. However, for reasons of general compatibility, we utilize below the matplotlib
and seaborn
Python libraries through the visuzalization_helper_functions
which we imported initially, providing that way some basic visualizations to help our discussion.
Note, that not all the features of the purchasing_data
set are usefull for a univariate summary statistics analysis. We have 25900 different values recorded as InvoiceNo
and more than 4000 different StockCode
s and CustomerID
s, all of which are nominal categorical variables. Furthermore, the product Description
can not help by any mean our analysis, whereas it is not meaningful to make a univariate analysis of the datetime field, InvoiceDate
.
In [8]:
purchasing_data_df = purchasing_data.to_dataframe()
print purchasing_data_df.describe(include='all')
So by excluding from our consideration the attributes Description
and InvoiceNo
and keeping aside the StockCode
and CustomerID
which we will draw in separate frequency plots later, the univariate summary statistics of the remaining features can be visualized as follows.
In [10]:
%matplotlib inline
univariate_summary_statistics_plot(purchasing_data,
['Quantity','UnitPrice', 'Country'],
nsubplots_inrow=2, subplots_wspace=0.5)
In [11]:
gl.Sketch(purchasing_data['CustomerID'])
Out[11]:
In [12]:
gl.Sketch(purchasing_data['StockCode'])
Out[12]:
Note that some values seems peculiar in this data set. First, some records have negative Quantity
and UnitPrice
values, whereas a large number of CustomerID
s are missing. We have more than 135000 such occurences (anonymous 'X2'
values) in the corresponding gl.Sketch()
above. However, since we want to predict which customers are likely to churn, record lines with missing CustomerID
s can not help our model and we should better exclude them. The purchasing_data
set now becomes.
In [13]:
purchasing_data = purchasing_data[purchasing_data['CustomerID']!='']
purchasing_data.show()
In [14]:
%matplotlib inline
univariate_summary_statistics_plot(purchasing_data,
['Quantity','UnitPrice', 'Country'],
nsubplots_inrow=2, subplots_wspace=0.5)
Note, that no negative values in the UnitPrice
attribute exist any more, and except of some incredible expensive outliers, everything looks much more reasonable now.
In [16]:
gl.Sketch(purchasing_data['UnitPrice'])
Out[16]:
Note also that beyond the 99% quantile most of the transactions are some manual cancellations [Credit Notes (C-prefixed Invoices) with negative values in the Quantity
field] and the remaining ones some expensive products. Therefore, we should better keep all these records despite being outliers.
In [17]:
unit_price_outliers = purchasing_data[purchasing_data['UnitPrice'] > 15.95]
unit_price_outliers.sort('UnitPrice', ascending=False)
Out[17]:
For completeness, the UnitPrice
distribution beyond the 99% quantile is as follows:
In [18]:
%matplotlib inline
# transform the SFrame of interest into a Pandas DataFrame
unit_price_outliers_df = unit_price_outliers.to_dataframe()
# define seaborn style, palette, color codes
sns.set(style="whitegrid", palette="deep",color_codes=True)
# draw the UnitPrice distplot
ax = sns.distplot(unit_price_outliers_df.UnitPrice,
bins=None, hist=True, kde=False, rug=False, color='b')
or zooming into the (15.95, 500]
UnitPrice
interval:
In [19]:
%matplotlib inline
# define seaborn style, palette, color codes
sns.set(style="whitegrid", palette="deep",color_codes=True)
# draw the UnitPrice distplot
ax = sns.distplot(unit_price_outliers_df[unit_price_outliers_df['UnitPrice']<500].UnitPrice,\
bins=None, hist=True, kde=False, rug=False, color='b')
Furthermore, taking a closer look in the part of the record with the negative procured Quantities, we can safely assume that these cases concern product returns, discounts (StockCode='D'
) and manual corrections (StockCode='M'
). Therefore, having a complete purchasing record per customer requires keeping all these transactions.
In [20]:
purchasing_data_neg_quantity = purchasing_data[purchasing_data['Quantity'] < 0]
purchasing_data_neg_quantity.show()
In [21]:
gl.Sketch(purchasing_data_neg_quantity['Quantity'])
Out[21]:
Indeed, the customer with CustomerID='12607'
procured a bunch of various products on 10/10/2011 at a total cost of $1579.51 and returned all of them back canceling the previously issued Invoice (InvoiceNo='570467'
) two days later. Note the 'C'
prefix in the InvoiceNo
of the Credit Note. Of course, this specific customer is unlikely to make another purchase, a fact that we will miss if we drop the lines of product returns. More importantly, we need to take seriously this event of complete purchase cancellation by updating appropriately the predictive model we will build.
In [22]:
purchasing_data_cust_12607 = purchasing_data.filter_by('12607', 'CustomerID')
purchasing_data_cust_12607[['InvoiceNo', 'StockCode', 'Description', 'Quantity',\
'InvoiceDate']].\
print_rows(num_rows=205, max_column_width=20)
In [23]:
purchasing_data_cust_12607['CostPerStockCode'] =\
purchasing_data_cust_12607['Quantity'] * purchasing_data_cust_12607['UnitPrice']
purchasing_data_cust_12607.groupby(key_columns = ['InvoiceNo', 'InvoiceDate', 'CustomerID'],
operations = {'Total Cost of Invoice':\
agg.SUM('CostPerStockCode')}).sort('InvoiceDate')
Out[23]:
As another example, the customer with CustomerID='18139'
made a series of purchases on 11/21/2011 and the day after, 11/22/2011, at a total cost of $8393.22. Note, that among the issued invoices two of them (with InvoiceNo
'C578073'
and 'C578076'
) involved manual correction (StockCode='M'
) of the paid amount. These kind of transactions may be also important to achieve better predictive performance.
In [24]:
purchasing_data_cust_18139 = purchasing_data.filter_by('18139', 'CustomerID')
purchasing_data_cust_18139['CostPerStockCode'] =\
purchasing_data_cust_18139['Quantity'] * purchasing_data_cust_18139['UnitPrice']
purchasing_data_cust_18139[['InvoiceNo', 'StockCode', 'Description', 'Quantity',\
'InvoiceDate']].\
print_rows(num_rows=170, max_column_width=20)
In [25]:
purchasing_data_cust_18139['CostPerStockCode'] =\
purchasing_data_cust_18139['Quantity'] * purchasing_data_cust_18139['UnitPrice']
purchasing_data_cust_18139.groupby(key_columns = ['InvoiceNo', 'InvoiceDate', 'CustomerID'],
operations = {'Total Cost of Invoice':\
agg.SUM('CostPerStockCode')}).sort('InvoiceDate')
Out[25]:
In [26]:
purchasing_data_cust_18139.groupby(key_columns = 'CustomerID',
operations = {'Total Cost of Invoice':\
agg.SUM('CostPerStockCode')})
Out[26]:
In [27]:
purchasing_data.print_rows(num_rows=10, max_column_width=20)
In [28]:
purchasing_data.show()
Among the available attributes, the Description
column is not going to help the model and should be excluded. Furthermore, the nominal categorical attribute InvoiceNo
has too many different values (more than 22000) to be helpful for the GraphLab Create churn_predictor
toolkit. However, as we saw earlier what is important in the historical purchasing record of a customer is not the specific InvoiceNo
against which he/she procured some goods but the type of the issued invoice; was it a Purchase/Standard (Invoice Receipt) or a Cancelling Invoice (Credit Note)? This type in fact determines if an actual purchase happened or a product return, and our learning algorithm does not need to know something more from this specific attribute. Therefore, we should better denote:
InvoiceNo
with an 'IR'
string and InvoiceNo
with a 'CN'
string.
In [29]:
purchasing_data.remove_column('Description')
purchasing_data['InvoiceNo'] = purchasing_data['InvoiceNo'].apply(lambda x: 'CN' if 'C' in x else 'IR')
purchasing_data.print_rows(num_rows=20, max_column_width=20)
The purchasing_data
set now becomes:
In [30]:
purchasing_data.show()
In [31]:
univariate_summary_statistics_plot(purchasing_data,
['InvoiceNo','Quantity','UnitPrice','Country'],
nsubplots_inrow=2,
subplots_wspace=0.7)
In [32]:
purchasing_data.dtype
Out[32]:
The frequency plots for the multi-valued, nominal categorical variables, CustomerID
, and StockCode
are shown below.
In [33]:
%matplotlib inline
item_freq_plot(purchasing_data, 'CustomerID', topk=30)
In [34]:
%matplotlib inline
item_freq_plot(purchasing_data, 'StockCode', topk=30)
To better utilize our training set a joint consideration of the two attributes InvoiceNo
and StockCode
would be advisable. This can be achieved by adding an extra derived attribute by concatenating the values of the other two. By doing so we would be able to collectively take in account both the type of the issued Invoice [Invoice Receipt ('IR'
) or Credit Note('CN'
)], and the specific StockCode
which was procured or returned per record line. However, as it is immediately recognizable below, the percentages of distinct StockCode
s which appeared in the issued Credit Notes ('CN'
)s are particularly low, and the idea of jointly considering Invoice Types and Product Codes can not help much.
In [35]:
%matplotlib inline
item_freq_plot(purchasing_data, 'StockCode', hue='InvoiceNo', topk=30,
pct_threshold= 0.01, reverse=True)
In [36]:
purchasing_data_cn = purchasing_data[purchasing_data['InvoiceNo']=='CN']
purchasing_data_cn.groupby(['StockCode','InvoiceNo'], agg.COUNT()).\
sort('Count', ascending=False).print_rows(num_rows=30)
Finally, we want to separate some users into a train/test set, making sure the test users are not in the training set, and creating TimeSeries objects out of them.
In [37]:
(train, test) = gl.churn_predictor.\
random_split(purchasing_data, user_id='CustomerID', fraction=0.95, seed=1)
train_trial = gl.TimeSeries(train, index='InvoiceDate')
test_trial = gl.TimeSeries(test, index='InvoiceDate')
train/test
data sets have been splitted in such a way to keep different customers (CustomerID
). This is because we want our learning algorithm to be blind on the data set which we are going to use for its validation and final evaluation.
In [38]:
# load NumPy Library
import numpy as np
# verify that train and test sets have different CustomerIDs
train_customer_ids = train['CustomerID'].unique().sort().to_numpy()
test_customer_ids = test['CustomerID'].unique().sort().to_numpy()
print 'Number of same \'CustomerID\'s in test and train set: %d' %\
len(test_customer_ids[np.in1d(test_customer_ids, train_customer_ids, assume_unique=True)])
In [39]:
print "Train Start date : %s" % train_trial.min_time
print "Train End date : %s" % train_trial.max_time
In [40]:
print "Test Start date : %s" % test_trial.min_time
print "Test End date : %s" % test_trial.max_time
In [41]:
userdata = gl.SFrame('./userdata_sf/')
userdata
Out[41]:
In [42]:
userdata.dtype
Out[42]:
We also change the dtype
of the nominal categorical attribute CustomerID
to str
.
In [43]:
userdata['CustomerID'] = userdata['CustomerID'].astype(str)
Here, we train the Churn Predictor model. We determine the period of inactivity (churn_period_trial
) after which a customer will be considered as churned. Furthermore, in order to extract more signal out of the data, we choose multiple churn boundaries for the same users. Remember that both train and test Timeseries start from 2010-12-01 and end at 2011-12-09.
In [41]:
## churn period of inactivity
churn_period_trial = dt.timedelta(days = 30)
## In order to extract more signal out of the data,
## Define multiple churn boundaries for the same users.
churn_boundary_aug = dt.datetime(year = 2011, month = 8, day = 1)
churn_boundary_sep = dt.datetime(year = 2011, month = 9, day = 1)
churn_boundary_oct = dt.datetime(year = 2011, month = 10, day = 1)
In [42]:
train.head(10)
Out[42]:
In [44]:
## train the Churn Predictor model
model = gl.churn_predictor.create(train_trial,
user_id = 'CustomerID',
features = None,
user_data = userdata,
churn_period = churn_period_trial,
# The time-scale/granularity
# at which features are computed (1 day here)
time_period = dt.timedelta(1),
# The various multiples of `time_period` used
# while computing features
lookback_periods = [7, 14, 21, 60, 90],
# Multiple churn boundaries to extract more signal out of data
time_boundaries = [churn_boundary_aug, churn_boundary_sep,\
churn_boundary_oct],
use_advanced_features=True)
In [30]:
# Evaluate this model in October
evaluation_time = churn_boundary_oct
In [31]:
# define cutoffs to consider
cutoffs_list = np.linspace(0.1, 1.0, num=100)
cutoffs_list = cutoffs_list.tolist()
# calculate metrics using test set
metrics = model.evaluate(test_trial, evaluation_time,\
user_data=userdata, cutoffs=cutoffs_list)
In [32]:
print(metrics)
In [33]:
def plot_pr_curve(precision, recall, title):
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 7, 5
plt.plot(precision, recall, 'g-', linewidth=2.0)
plt.plot(precision, recall, 'bo')
plt.title(title)
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.rcParams.update({'font.size': 16})
def plot_roc_curve(fpr, tpr, title):
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 7, 5
plt.plot(fpr, tpr, 'g-', linewidth=2.0)
plt.title(title)
plt.xlabel('FP Rate')
plt.ylabel('TP Rate')
plt.rcParams.update({'font.size': 16})
In [34]:
%matplotlib inline
precision_all = metrics['precision_recall_curve']['precision']
recall_all = metrics['precision_recall_curve']['recall']
plot_pr_curve(precision_all, recall_all, 'Precision recall curve (Churn Predictor)')
In [35]:
%matplotlib inline
fpr = metrics['roc_curve']['fpr']
tpr = metrics['roc_curve']['tpr']
plot_roc_curve(fpr, tpr, 'ROC Curve (Churn Predictor)')
print 'AUC: %.5f' % metrics['auc']
Of course, there is enough room for improvement if we carefully tweak the hyperparameters of the boosted trees classifier which is the underlying classifier of the GraphLab Create churn_predictor
ML toolkit.
One can even obtain an interactive overview of the trained model as evaluated on the test set as shown below. For details, consult Dato's User Guide for Chrurn Prediction Models.
In [36]:
overview = model.views.overview(test_trial, evaluation_time, user_data=userdata)
In [37]:
overview.show()
To have a feeling of the outcome of this call, we provide below screen snapshots of both the Explore and Evaluate views of this interactive feature. In the Explore view, one can better investigate how the customers has been clustered in different segments, the corresponding average predicted probability to churn, as well as the reasoning that this might happen for the customers of each segment.
In the Evaluate view shown below, one can interactively investigate the Precision and Recall values returned by the model for a given Churn Probability Threshold (default value is 50% probability). The application also returns a histogram of percentages of the actual churned users with regard to the predicted ones binned across the whole spectrum of the Predicted Churn Probability. Please note that this feature is currently in beta and should be used with caution and not in mission-critical applications.
The goal of a churn prediction model is to predict the probability that a user has no activity for a churn_period
of time in the future. First, we set a specific time (prediction_time
) at which the predictions must be made. Then we can obtain the desired predictions by calling the .predict()
method of the trained model.
In [38]:
prediction_time = dt.datetime(2011, 9, 1)
predictions = model.predict(test_trial, time_boundary=prediction_time, user_data=userdata)
predictions.print_rows()
To get the 20 most probable customers to churn.
In [39]:
predictions.sort('probability', ascending=False).print_rows(num_rows=20)
Note:
During the predictions process, you may get a message
Not enough data to make predictions for 60 user(s).
as happens above. This might happen for several reasons. Some of which are:
prediction_time
.lookback_periods
(periods of time used during the feature engineering process). The model looks at the usage history in the recent past to make a forecast. If no activity was present during this recent past, then predictions cannot be made. A prediction of 'None'
is returned in this case.
In [40]:
explanations = model.explain(test_trial, time_boundary=prediction_time, user_data=userdata)
To obtain the list of customers with a probability higher than 90% to churn:
In [41]:
print explanations[explanations['probability'] > 0.9]
To get the explanation returned by the model for the CustomerID='12427'
lets say:
In [42]:
def get_explanation(cust_id, explanations):
print 'Churn Explanation for CustomerID: %s' % cust_id
print '------------------------------------------'
for explanation in explanations[explanations['CustomerID']==cust_id]['explanation'][0]:
print explanation
In [43]:
get_explanation('12427', explanations)
And if we want to obtain the list of customers with probability greater than 50% to churn but less than 60%:
In [44]:
print explanations[(explanations['probability'] > 0.5) & (explanations['probability'] < 0.6)]
Assuming we want to get a greater insight for the CustomerID='15265'
we can make the call:
In [45]:
get_explanation('15265', explanations)
It is remarkable to note how different from before is the explanation returned for this case.
Finally, one can even obtain a a detailed churn report which clusters the users into segments depending on the probability and the reason for churn.
In [46]:
report = model.get_churn_report(test_trial, time_boundary=prediction_time, user_data=userdata)
The detailed churn report as returned by the model we trained, but sorted in a descending average probability is enlisted below. It consists of 32 different segments (segment_id
) of customers depending on their average probability (avg_probability
) to churn and the reasoning (explanation
) that this might happen. The model also returns the specific CustomerID
s per customer segment.
In [55]:
report.sort('avg_probability', ascending=False).print_rows(num_rows=35, max_column_width=15)
Finally, lets take a closer look in the event log of a customer who has been predicted with high probability to churn. More specifically, lets check the timeseries of events of CustomerID='13211'
as found in the segment_id='4'
of test_trial
set and had predicted with an avg_probability = 0.903421559504
to churn.
In [61]:
eventlog_cust_13211 = test_trial.filter_by('13211', 'CustomerID')
eventlog_cust_13211[['InvoiceNo', 'StockCode', 'Description', 'Quantity',\
'InvoiceDate']].print_rows(num_rows=60, max_column_width=25)
Note that this specific customer (with CustomerID='13211'
) has bought a large number of products on September 5th, 2011 and on November 29th of the same year, but she finally ends her relationship with the online retailer. The explanation returned by the model for all the 14 customers who were found in segment_id='4'
in the report above was:
In [62]:
report[report['segment_id']=='4']['explanation'][0]
Out[62]:
In [63]:
model.trained_model
Out[63]:
To get the training data after feature engineering:
In [64]:
train_data = model.processed_training_data
train_data.print_rows(max_row_width=100)
Finally, to obtain the most important features for this model:
In [65]:
model_feature_importance = model.trained_model.get_feature_importance()
print model_feature_importance
In [66]:
gl.canvas.set_target('ipynb')
model_feature_importance['name'].show()
For details of how the GraphLab Create Churn Prediction toolkit works, consult Dato's User Guide for Churn Prediction toolkit internals and the Dato Blog, Posted by Antoine Atallah, Software Engineer and Data Scientist at Dato. Details for the API functionality which is currently available can be found in the corresponding GraphLab Create API Documentation.
In [ ]: